

### Hewlett Packard Enterprise

## Future opportunities in high performance and low power computing with emerging technologies and novel architectures

John Paul Strachan Hewlett Packard Labs, HPE LETI Devices Workshop – December 2, 2018

#### Outline

- The rise and demand for efficient accelerators
- > The memristor-based accelerator for A.I./Machine Learning
- Future opportunities: brain-inspired approaches and alternatives to quantum computing



#### HW accelerators – increased performance for special cases



#### Unlike before, we work hard for limited performance gains



## Some Key Drivers for Specialization: Data Explosion & Al



**Hewlett Packard** Enterprise

## Motivating example: Autonomous/Assisted Driving





## But we need Billions of miles for safety

How many miles (years<sup>a</sup>) would autonomous vehicles have to be driven...

(1) without failure to demonstrate with 95% confidence that their failure rate is at most...

(2) to demonstrate with 95% confidence their failure rate to within 20% of the true rate of...

(3) to demonstrate with 95% confidence and80% power that their failure rate is 20% betterthan the human driver failure rate of...

(A) 1.09 fatalities per 100 million miles?

275 million miles (12.5 years)

8.8 billion miles (400 years)

11 billion miles (500 years)

Source: RAND Corp." Driving to Safety"



Safe, autonomous vehicles depend on billions of miles of <u>simulated</u> driving

Need for accelerators in the Data Center!



#### **Conventional accelerators**

#### **CPU extensions** ISA-level acceleration

- Vector and matrix extensions
- Reduced precision
- Example: ARM SVE2

#### 256-bit 128-bi x0 x1 x2 x3 x4 ... 8 laxpy ×4. whilelt p0.d. x4. 1d1d z2.d, p0/z, [x1,x4.lsl #3 1d1d p0/m. z1.d. z0.d fmla st1d p0, [x1,x4,lsl #3] incd latch whilelt b.first ret

#### **GPUs** Data parallel calculations

- Optimized for throughput
- High-bandwidth memory
- Example: Nvidia, AMD



#### **Deep Learning Accelerators** ASIC-like flexible performance

- Data-flow inspired, systolic, spatial
- Cost optimized
- Example: Google's TPU, FPGAs





## **Unconventional accelerators**

#### Analog neuromorphic computing

Massive speedup for AI training and inference

- Complex matrix calculations in one step
- 10-100x faster
- 10-1000x more energy efficient (Compared to GPU)



#### **Optical Computing**

Designed for "unsolvable" optimization problems

- Harnessing the properties of light at the microscale
- Prototype has world record 1,000 optical components
- Scalable to 100,000 components







## The memristor Dot Product Engine (DPE)





- Harness memristors in dense crossbar arrays
- Memristor = non-volatile, analog memory cell
- Parallel activation of every row and column in crossbar
- Vector-matrix multiplication (VMM) in a single cycle
- Computing = read operation
- Efficient multiply & add in analog domain
- Key advantage is <u>in-memory processing</u>

## **Dot Product Engine: working prototype chip**







Successful MNIST Neural Network inference with memristor-based analog computing

M. Hu, et. al, Adv. Mater. 2018





Reduces computations from  $O(Cm^2n^2)$  operations to  $O(n^2)$ 

C. Li, et. al, Nature Electronics, (2018)



## System Architecture, Compiler, & Software Support

- Developed Architecture supporting all state-of-the-art neural networks (CNN, LSTM, MLPs, RBMs, etc.)
- Developed an "Assembly" code (ISA) for our memristor accelerator
- Built a compiler, with support for standard ONNX format





 Neural Network specification (ONNX) – CNN, LSTM, etc

#### Compiler

 Convert to DPE Assembly; Map to crossbars

#### Simulator

 Provide performance metrics (accuracy, energy, latency, etc.)

#### Benchmarking

#### Inference energy normalized to PUMA (lower is better)



Larger networks (NMT, WLM) benefit the most

Hewlett Packard Enterprise

#### Benchmarking

#### Inference latency normalized to PUMA (lower is better)



Lower latency than CPUs (10-10,000x) and NVIDIA GPUs (10-100x) Larger networks (NMT, WLM) benefit the most

Hewlett Packard Enterprise



#### Hewlett Packard Enterprise

## Future opportunities: brain-inspired approaches as alternative to quantum computing

#### Memristors also provide neuron-like behavior



Can build a "neuronic" circuit element from a memristor (NbO<sub>2</sub> device shown here)

#### Directly emulates signals seen in brains





## Highly compact artificial neuron



Compared to brain: 500x frequency 100x less energy/spike 100 nm vs 100 µm

 $\frac{1}{\frac{1}{Enterprise}} Dark field cross-sectional TEM image of NbO<sub>x</sub> memristor <math>R_{th}C_{th} \le 0.1 \text{ ns}$ 

## **Apply to Important Optimization Problems**

NP-hard and NP-complete problems:

For a problem of size N, running time or memory use grows >> exp(N)

#### **Important Graph Problems:**

"Set Cover" - applies to airline flight scheduling

"Traveling salesmen" – UPS, shipping

"Max-cut" – applies to VLSI layout, routing



#### Example :

Every year, the National Football League (NFL) builds their 256-game schedule for the next season

- Have to consider team match-ups, stadium usage by other events, traffic, etc.
- Takes ~3months on a 1000-core system to solve!

\*Source: Gurobi CEO Edward Rothberg



#### **Optimization Accelerator: memristor- Hopfield Network**



S Kumar, et al. Nature (2017)



#### **Optimization Accelerator: memristor- Hopfield Network**



*S Kumar, et al. Nature (2017) F. Cai, et al., manuscript in preparation* 



## Summary

- -The computing world has become **heterogeneous**, there is no turning back
- -Big opportunities to speed up applications with significant markets
- -You can jump >20 years into the tech future with a special purpose accelerator
- -Harness emerging devices to build new architectures
- -But we also need software to rise to the challenge
  - Can't depend on hardware to keep up performance growth
- -We must consider system balance (compute, memory bandwidth, cooling)
- We are kicking off a new Cambrian explosion, with plenty of extinctions coming – an exciting time to be designing computing systems!



# Thank you

65

abs.hpe.com



## Acknowledgments

# HPE Labs

**Catherine Graves** Suhas Kumar Miao Hu Xia Sheng Xuema Li Martin Foltin Dejan Milojicic Amit Sharma Fuxi Cai Rui Liu

#### **University Collaborators**

Jianhua Yang (UMass Amherst) Qiangfei Xia Can Li Aayush Ankit (Purdue) Kaushik Roy Izzat El Hajj (UIUC) Wen-Mei Hwu Wei Liu (U Michigan) Shimeng Yu (GeorgiaTech)

#### Program support



Karl Roenigk Jeffrey Weinschenk Richart Slusher Chad Meiners (Lincoln Lab) Chris Algire (NGA)

